Skip to content

Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation#800

Open
polarG wants to merge 5 commits intomainfrom
dev/hongtaozhang/fix-num_workers-usage
Open

Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation#800
polarG wants to merge 5 commits intomainfrom
dev/hongtaozhang/fix-num_workers-usage

Conversation

@polarG
Copy link
Copy Markdown
Contributor

@polarG polarG commented Apr 1, 2026

Description
preprocess_data.py uses multiprocessing.Pool(workers), which requires workers >= 1. When num_workers is set to 0 (valid for DataLoader, where it means "load in main process"), the preprocessing step crashes.

This change clamps the worker count to max(1, self._args.num_workers) before passing it to preprocess_data.py, while leaving the original num_workers value unchanged for other uses like DataLoader.

@polarG polarG requested a review from a team as a code owner April 1, 2026 06:07
Copilot AI review requested due to automatic review settings April 1, 2026 06:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a crash in the Megatron GPT-3 dataset preprocessing path when --num_workers=0 by ensuring the worker count passed to Megatron’s tools/preprocess_data.py is at least 1, while keeping num_workers unchanged for other consumers (e.g., PyTorch DataLoader).

Changes:

  • Clamp --workers for preprocess_data.py to max(1, self._args.num_workers) to avoid multiprocessing.Pool(0) failures.
  • Add inline clarification comment documenting why preprocessing clamps workers while other components may allow num_workers=0.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py
Comment thread superbench/benchmarks/model_benchmarks/megatron_gpt3.py Outdated
@polarG polarG self-assigned this Apr 17, 2026
@polarG polarG added enhancement New feature or request ROCm labels Apr 27, 2026
- Megatron's preprocess_data.py appends '_text_document' to the
  --output-prefix when producing the .bin/.idx files. Derive the
  output-prefix from --data_prefix (stripping a trailing
  '_text_document' suffix when present) so that the generated files
  match the existence checks for any custom data_prefix value, instead
  of being hardcoded to '<data_home>/dataset'.
- Add unit test test_megatron_gpt_dataset_generate_command covering:
  num_workers=0 clamps to '--workers 1' with default data_prefix
  ('dataset_text_document' -> '<data_home>/dataset'), num_workers=4
  with custom 'custom_text_document' -> '<data_home>/custom', and a
  data_prefix without the suffix used as-is.
Copilot AI review requested due to automatic review settings April 27, 2026 22:02
Avoid '--workers 1' matching '--workers 10' etc.
@polarG polarG force-pushed the dev/hongtaozhang/fix-num_workers-usage branch from a475b85 to ee2839e Compare April 27, 2026 22:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/benchmarks/model_benchmarks/test_megatron_gpt.py Outdated
Comment thread tests/benchmarks/model_benchmarks/test_megatron_gpt.py Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.85%. Comparing base (036c471) to head (ee2839e).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #800      +/-   ##
==========================================
+ Coverage   85.69%   85.85%   +0.15%     
==========================================
  Files         103      103              
  Lines        7890     7894       +4     
==========================================
+ Hits         6761     6777      +16     
+ Misses       1129     1117      -12     
Flag Coverage Δ
cpu-python3.12-unit-test 70.59% <100.00%> (+0.16%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Hongtao Zhang added 2 commits May 2, 2026 00:19
…et generate test

- Replace brittle whitespace substring assertions ('--workers 1 ', '--output-prefix ... ') with normalize_command()-based parsed CLI unit checks, so the test validates semantics rather than formatting.
- Use --code_base {self._tmp_dir} together with createMockFiles(['pretrain_gpt.py']) to avoid the unrealistic /root/Megatron-DeepSpeed path. The mocked run_command now creates the expected .bin/.idx files via side_effect so _preprocess() succeeds end-to-end and is asserted to be True.
- Warn when num_workers is silently clamped from 0 to 1 for the
  preprocess subprocess so the user sees the override in the log.
  The DataLoader still receives the original num_workers value.
- Guard the '_text_document' suffix-strip against the edge case where
  data_prefix == '_text_document' (which would otherwise produce a
  malformed --output-prefix '<data_home>/' with an empty basename).
  Fall back to the original data_prefix value in that case.
- Extend test_megatron_gpt_dataset_generate_command with a 4th case
  asserting the empty-basename fallback.
Copilot AI review requested due to automatic review settings May 3, 2026 05:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +656 to +666
# when producing the .bin/.idx files. Derive the output-prefix from
# data_prefix (stripping the '_text_document' suffix when present) so that
# the generated files match the existence checks above for any custom
# data_prefix value.
output_prefix_basename = self._args.data_prefix
if output_prefix_basename.endswith('_text_document'):
stripped = output_prefix_basename[:-len('_text_document')]
# Guard against data_prefix == '_text_document' which would
# leave an empty basename and produce a malformed --output-prefix
# ending in '/'. Fall back to the original value in that case.
output_prefix_basename = stripped or output_prefix_basename
Comment on lines +243 to +260
# Case 3: data_prefix without the '_text_document' suffix is used as-is.
_run_case(
extra_params='--num_workers 2 --data_prefix mydata',
expected_workers=2,
expected_prefix_basename='mydata',
expected_data_prefix='mydata',
)

# Case 4: edge case - data_prefix == '_text_document' should NOT strip down
# to an empty basename (which would produce '--output-prefix <data_home>/').
# Fall back to using '_text_document' as the basename.
_run_case(
extra_params='--num_workers 1 --data_prefix _text_document',
expected_workers=1,
expected_prefix_basename='_text_document',
expected_data_prefix='_text_document',
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants